Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

may not be a constant.

d on the principle of the kernel method, the principle of using a

matrix to quantify how similar two sequences are, which is used

nce homology alignment can be considered to measure the

y between peptides as well [Holm and Sander, 1996]. For

to align two protein sequences, an amino acid mutation matrix

the Dayhoff matrix [Dayhoff and Schwartz, 1978] has been

ed [Lipman, et al., 1989]. This idea has led to the development of

ithms which employ a mutation matrix for protease cleavage

scovery.

tation matrix is normally derived based on a large number of

equences. A substitution entry in a mutation matrix actually

s the mutation rate between two amino acids. A mutation matrix

only derived through investigating thousands of sequences within

cal family. This means that a mutation matrix is perhaps more

lly sound and robust to measure the similarity between peptides.

are three factor Xa protease cleaved peptides, x1=IEGRT,

I and x3=IEGRD. When using the binary encoding approach, a

ector for each amino acid in a peptide has only one entry set as

ing other 19 entries as zeros. Therefore, the pairwise distance

a pair of peptides actually depends on the difference of residues

peptides. The pairwise distance between these three sub-

s is always one because only one residue is occupied by different

ids. However, the pairwise homology score (similarity) between

e sub-sequences will not be a constant. Table 3.10 shows a partial

amino acid mutation matrix. Suppose this mutation matrix is used

re the pair-wise similarity between these three sub-sequences.

al similarity between ݔଵ and ݔଶ for the fifth residue (between the

id T and the amino acid I) is −2. The partial similarity between

ଷ^{is −4. The partial similarity between}^ݔଶ^and^ݔଷ^{for the fifth}

between the amino acid T and the amino acid D) is −6. It can be

they are not constants at all. This shows a very important concept

binary encoding approach may not be able to reflect the true

l relationship between peptides.